Vector Quantized VAEs

Overview

VQ-VAE replaces the continuous Gaussian latent with a discrete codebook lookup — the encoder output is “snapped” to the nearest learned embedding vector. This enables cleaner compression and makes it easy to model the latent space with autoregressive models (like GPT or PixelCNN) for generation.The key intuition to drag out of this: when you move z_e slowly across the codebook space, the decoded output is perfectly still — then snaps instantly the moment you cross into a new Voronoi cell. That discreteness is the whole point. VAE interpolates smoothly; VQ-VAE teleports.

Mathematical Formulation

Encoder (deterministic): \[z_e = E_\phi(x) \in \mathbb{R}^D\]

Quantization (non-differentiable): \[z_q = e_k, \quad k = \arg\min_j \|z_e - e_j\|_2\]

Loss (three terms): \[\mathcal{L} = \underbrace{\|x - \hat{x}\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}(z_e) - z_q\|^2}_{\text{codebook}} + \beta\underbrace{\|z_e - \text{sg}(z_q)\|^2}_{\text{commitment}}\]

Straight-through estimator (backward pass only): \[\frac{\partial \mathcal{L}}{\partial z_e} := \frac{\partial \mathcal{L}}{\partial z_q}\]